Nonparametric Assessment of Contamination in Multivariate Data Using Minimum Volume Sets and FDR

نویسندگان

  • Clayton Scott
  • Eric Kolaczyk
چکیده

Large, multivariate datasets from high-throughput instrumentation have become ubiquitous throughout the sciences. Frequently, it is of great interest to characterize the measurements in these datasets by the extent to which they represent ‘nominal’ versus ‘contaminated’ instances. However, often the nature of even the nominal patterns in the data are unknown and potentially quite complex, making their explicit parametric modeling a daunting task. In this paper, we introduce a nonparametric method for the simultaneous annotation of multivariate data (called MN-SCAnn), by which one may produce an annotated ranking of the observations, indicating the relative extent to which each may or may not be considered nominal, while making minimal assumptions on the nature of the nominal distribution. In our framework each observation is linked to a corresponding minimum volume set and, implicitly adopting a hypothesis testing perspective, each set is associated with a test, which in turn is accompanied by a certain false discovery rate. The combination of minimum volume set methods with false discovery rate principles, in the context of contaminated data, is new. Moreover, estimation of the key underlying quantities requires that a number of issues be addressed. We illustrate MN-SCAnn through examples in two contexts – the pre-processing of cell-based assays in bioinformatics, and the detection of anomalous traffic patterns in Internet measurement studies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nonparametric Assessment of Contamination in Multivariate Data Using Generalized Quantile Sets and FDR

Large, multivariate datasets from high-throughput instrumentation have become ubiquitous in the sciences. Frequently, it is of interest to characterize the measurements in these datasets by the extent to which they represent ‘nominal’ versus ‘contaminated’ instances. However, often the nature of even the nominal patterns in the data are unknown and potentially quite complex, making their explic...

متن کامل

Analysis of physiochemical and microbial quality of waters of the Karkheh River in southwestern Iran using multivariate statistical methods

Rapid population growth as well as agricultural and industrial development have increased the contamination of Iranian rivers. This study utilized principal components analysis (PCA) to determine the degree of significance of qualitative parameters of water resources in the Karkheh River in southwestern Iran. Cluster analysis (CA) grouped the monitoring stations based on the water quality data ...

متن کامل

Contamination and Risk Assessment of Total Aflatoxin in Iranian Rice in Various Food Security Regions

Background and Objectives: Aflatoxin is one of the most important and common toxins in high consumption foods such as rice, which can threaten health of the consumers. In this study, risk of total aflatoxin based on its exposure levels and adverse effects was assessed in Iranian rice.  Materials & Methods: This study was carried out on total aflatoxin in 60 national rice samples collected base...

متن کامل

Assessment of an ore body internal dilution based on multivariate geostatistical simulation using exploratory drill hole data

Dilution can best be defined as the proportion of waste tonnage to the total weight of ore and waste in each block. Predicting the internal dilution based on geological boundaries of waste and ore in each block can help engineers to develop more reliable long-term planning designs in mining activities. This paper presents a method to calculate the geological internal dilution in each block and ...

متن کامل

Mann - Withney multivariate nonparametric control chart.

In many quality control applications, the necessary distributional assumptions to correctly apply the traditional parametric control charts are either not met or there is simply not enough information or evidence to verify the assumptions. It is well known that performance of many parametric control charts can be seriously degraded in situations like this. Thus, control charts that do not requi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007